AITopics | arthur guez

Learning values across many orders of magnitude

Neural Information Processing SystemsMar-17-2026, 08:27:32 GMT

Most learning algorithms are not invariant to the scale of the signal that is being approximated. We propose to adaptively normalize the targets used in the learning updates. This is important in value-based reinforcement learning, where the magnitude of appropriate value approximations can change over time when we update the policy of behavior. Our main motivation is prior work on learning to play Atari games, where the rewards were clipped to a predetermined range. This clipping facilitates learning across many different games with a single learning algorithm, but a clipped reward function can result in qualitatively different behavior. Using adaptive normalization we can remove this domain-specific heuristic without diminishing overall performance.

artificial intelligence, machine learning, proceedings, (4 more...)

Neural Information Processing Systems

Industry: Leisure & Entertainment > Games > Computer Games (0.62)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

9381fc93ad66f9ec4b2eef71147a6665-Paper.pdf

Neural Information Processing SystemsFeb-9-2026, 09:16:22 GMT

information, learning, value prediction, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > Canada (0.04)

Industry: Leisure & Entertainment > Games (0.69)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Learning values across many orders of magnitude

Neural Information Processing SystemsNov-21-2025, 14:42:09 GMT

Most learning algorithms are not invariant to the scale of the signal that is being approximated. We propose to adaptively normalize the targets used in the learning updates. This is important in value-based reinforcement learning, where the magnitude of appropriate value approximations can change over time when we update the policy of behavior. Our main motivation is prior work on learning to play Atari games, where the rewards were clipped to a predetermined range. This clipping facilitates learning across many different games with a single learning algorithm, but a clipped reward function can result in qualitatively different behavior. Using adaptive normalization we can remove this domain-specific heuristic without diminishing overall performance.

learning value, magnitude, name change, (3 more...)

Neural Information Processing Systems

Industry: Leisure & Entertainment > Games > Computer Games (0.62)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Value-driven Hindsight Modelling

Neural Information Processing SystemsAug-15-2025, 04:03:15 GMT

V alue estimation is a critical component of the reinforcement learning (RL) paradigm.

information, learning, value prediction, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > Canada (0.04)

Industry: Leisure & Entertainment > Games (0.69)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Learning values across many orders of magnitude

Neural Information Processing SystemsFeb-11-2025, 19:10:24 GMT

Most learning algorithms are not invariant to the scale of the signal that is being approximated. We propose to adaptively normalize the targets used in the learning updates. This is important in value-based reinforcement learning, where the magnitude of appropriate value approximations can change over time when we update the policy of behavior. Our main motivation is prior work on learning to play Atari games, where the rewards were clipped to a predetermined range. This clipping facilitates learning across many different games with a single learning algorithm, but a clipped reward function can result in qualitatively different behavior.

arthur guez, learning value, magnitude, (1 more...)

Neural Information Processing Systems

Industry: Leisure & Entertainment > Games > Computer Games (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.31)

Add feedback

AlphaZeroES: Direct score maximization outperforms planning loss minimization

Martin, Carlos, Sandholm, Tuomas

arXiv.org Artificial IntelligenceJun-12-2024

Planning at execution time has been shown to dramatically improve performance for agents in both single-agent and multi-agent settings. A well-known family of approaches to planning at execution time are AlphaZero and its variants, which use Monte Carlo Tree Search together with a neural network that guides the search by predicting state values and action probabilities. AlphaZero trains these networks by minimizing a planning loss that makes the value prediction match the episode return, and the policy prediction at the root of the search tree match the output of the full tree expansion. AlphaZero has been applied to both single-agent environments (such as Sokoban) and multi-agent environments (such as chess and Go) with great success. In this paper, we explore an intriguing question: In single-agent environments, can we outperform AlphaZero by directly maximizing the episode score instead of minimizing this planning loss, while leaving the MCTS algorithm and neural architecture unchanged? To directly maximize the episode score, we use evolution strategies, a family of algorithms for zeroth-order blackbox optimization. Our experiments indicate that, across multiple environments, directly maximizing the episode score outperforms minimizing the planning loss.

agent, algorithm, alphazero, (15 more...)

arXiv.org Artificial Intelligence

2406.08687

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > Switzerland > Basel-City > Basel (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Industry: Leisure & Entertainment > Games (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)

Add feedback

Super-Exponential Regret for UCT, AlphaGo and Variants

Orseau, Laurent, Munos, Remi

arXiv.org Artificial IntelligenceMay-17-2024

We improve the proofs of the lower bounds of Coquelin and Munos (2007) that demonstrate that UCT can have $\exp(\dots\exp(1)\dots)$ regret (with $\Omega(D)$ exp terms) on the $D$-chain environment, and that a `polynomial' UCT variant has $\exp_2(\exp_2(D - O(\log D)))$ regret on the same environment -- the original proofs contain an oversight for rewards bounded in $[0, 1]$, which we fix in the present draft. We also adapt the proofs to AlphaGo's MCTS and its descendants (e.g., AlphaZero, Leela Zero) to also show $\exp_2(\exp_2(D - O(\log D)))$ regret.

algorithm, d-chain environment, exp 2, (14 more...)

arXiv.org Artificial Intelligence

2405.04407

Country: North America > United States > Virginia > Arlington County > Arlington (0.04)

Genre: Research Report (0.40)

Industry:

Leisure & Entertainment > Games > Go (0.72)
Information Technology > Software (0.62)

Technology:

Information Technology > Artificial Intelligence > Games > Go (0.62)
Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Add feedback

AI planning in the imagination: High-level planning on learned abstract search spaces

Martin, Carlos, Sandholm, Tuomas

arXiv.org Artificial IntelligenceDec-2-2023

Search and planning algorithms have been a cornerstone of artificial intelligence since the field's inception. Giving reinforcement learning agents the ability to plan during execution time has resulted in significant performance improvements in various domains. However, in real-world environments, the model with respect to which the agent plans has been constrained to be grounded in the real environment itself, as opposed to a more abstract model which allows for planning over compound actions and behaviors. We propose a new method, called PiZero, that gives an agent the ability to plan in an abstract search space that the agent learns during training, which is completely decoupled from the real environment. Unlike prior approaches, this enables the agent to perform high-level planning at arbitrary timescales and reason in terms of compound or temporally-extended actions, which can be useful in environments where large numbers of base-level micro-actions are needed to perform relevant macro-actions. In addition, our method is more general than comparable prior methods because it seamlessly handles settings with continuous action spaces, combinatorial action spaces, and partial observability. We evaluate our method on multiple domains, including the traveling salesman problem, Sokoban, 2048, the facility location problem, and Pacman. Experimentally, it outperforms comparable prior methods without assuming access to an environment simulator at execution time.

agent, algorithm, david silver, (13 more...)

arXiv.org Artificial Intelligence

2308.08693

Country: